Search CORE

95 research outputs found

On the PATHGROUPS approach to rapid small phylogeny

Author: A Caprara
AC Siepel
AW Xu
C Zheng
Chunfang Zheng
D Sankoff
D Sankoff
David Sankoff
E Tannier
G Fertin
KP Byrne
N El-Mabrouk
R Warren
S Yancopoulos
SM Hedtke
Z Adam
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

We present a data structure enabling rapid heuristic solution to the ancestral genome reconstruction problem for given phylogenies under genomic rearrangement metrics. The efficiency of the greedy algorithm is due to fast updating of the structure during run time and a simple priority scheme for choosing the next step. Since accuracy deteriorates for sets of highly divergent genomes, we investigate strategies for improving accuracy and expanding the range of data sets where accurate reconstructions can be expected. This includes a more refined priority system, and a two-step look-ahead, as well as iterative local improvements based on a the median version of the problem, incorporating simulated annealing. We apply this to a set of yeast genomes to corroborate a recent gene sequence-based phylogeny

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

CSMET: Comparative Genomic Motif Detection via Multi-Resolution Phylogenetic Shadowing

Author: A Sandelin
A Siepel
AC Siepel
AC Siepel
AM Moses
AM Moses
AM Moses
BE Engelhardt
C Bergman
C Boutilier
CM Bergman
D Boffelli
DA Papatsenko
EH Margulies
EP Xing
EP Xing
Eric P. Xing
GE Crooks
GJ Olsen
I Dubchak
J Felsenstein
J Felsenstein
J Felsenstein
J Pedersen
JD McAuliffe
M Blanchette
M Blanchette
M Blanchette
M Hasegawa
M Tompa
MC Frith
Mladen Kolar
MR Kantorovitz
MZ Ludwig
MZ Ludwig
MZ Ludwig
Pradipta Ray
PV Benos
R Siddharthan
RG Cowell
S Sinha
S Sinha
SB Montgomery
Suyash Shringarpure
T Wang
TH Jukes
Uwe Ohler
W Huang
Publication venue: Public Library of Science
Publication date: 01/06/2008
Field of study

Functional turnover of transcription factor binding sites (TFBSs), such as whole-motif loss or gain, are common events during genome evolution. Conventional probabilistic phylogenetic shadowing methods model the evolution of genomes only at nucleotide level, and lack the ability to capture the evolutionary dynamics of functional turnover of aligned sequence entities. As a result, comparative genomic search of non-conserved motifs across evolutionarily related taxa remains a difficult challenge, especially in higher eukaryotes, where the cis-regulatory regions containing motifs can be long and divergent; existing methods rely heavily on specialized pattern-driven heuristic search or sampling algorithms, which can be difficult to generalize and hard to interpret based on phylogenetic principles. We propose a new method: Conditional Shadowing via Multi-resolution Evolutionary Trees, or CSMET, which uses a context-dependent probabilistic graphical model that allows aligned sites from different taxa in a multiple alignment to be modeled by either a background or an appropriate motif phylogeny conditioning on the functional specifications of each taxon. The functional specifications themselves are the output of a phylogeny which models the evolution not of individual nucleotides, but of the overall functionality (e.g., functional retention or loss) of the aligned sequence segments over lineages. Combining this method with a hidden Markov model that autocorrelates evolutionary rates on successive sites in the genome, CSMET offers a principled way to take into consideration lineage-specific evolution of TFBSs during motif detection, and a readily computable analytical form of the posterior distribution of motifs under TFBS turnover. On both simulated and real Drosophila cis-regulatory modules, CSMET outperforms other state-of-the-art comparative genomic motif finders

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Accurate Detection of Recombinant Breakpoints in Whole-Genome Alignments

Author: A Rambaut
AC Siepel
Aviv Regev
D Filho
D Husmeier
D Husmeier
G McGuire
Ian Holmes
J Archer
J Felsenstein
J Hein
JD Thompson
JP Gomes
K Lau
K Lole
LD Bowler
M Arenas
M Hasegawa
M Thomson
MJ Minichiello
N Friedman
Oscar Westesson
P Awadalla
P Puigbo
R Durbin
R Hudson
RC Edgar
RC Elston
TJ Anderson
VN Minin
YS Song
Publication venue: Public Library of Science
Publication date: 01/03/2009
Field of study

We propose a novel method for detecting sites of molecular recombination in multiple alignments. Our approach is a compromise between previous extremes of computationally prohibitive but mathematically rigorous methods and imprecise heuristic methods. Using a combined algorithm for estimating tree structure and hidden Markov model parameters, our program detects changes in phylogenetic tree topology over a multiple sequence alignment. We evaluate our method on benchmark datasets from previous studies on two recombinant pathogens, Neisseria and HIV-1, as well as simulated data. We show that we are not only able to detect recombinant regions of vastly different sizes but also the location of breakpoints with great accuracy. We show that our method does well inferring recombination breakpoints while at the same time maintaining practicality for larger datasets. In all cases, we confirm the breakpoint predictions of previous studies, and in many cases we offer novel predictions

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Using ESTs to improve the accuracy of de novo gene prediction

Author: A Krogh
AA Salamov
AC Siepel
C Wei
Chaochun Wei
DR Maglott
E Birney
I Korf
JE Allen
JE Allen
KD Pruitt
KD Pruitt
KL Howe
L Stein
LW Hillier
M Stanke
MG Reese
Michael R Brent
MJ van Baren
MR Brent
MS Boguski
P Flicek
R Guigo
R Guigó
R Mott
RA Gibbs
RH Brown
RH Waterston
S Foissac
SS Gross
The MGC Project Team
TW Harris
TW Harris
VV Solovyev
WJ Kent
Publication venue: BioMed Central
Publication date: 01/07/2006
Field of study

BACKGROUND: ESTs are a tremendous resource for determining the exon-intron structures of genes, but even extensive EST sequencing tends to leave many exons and genes untouched. Gene prediction systems based exclusively on EST alignments miss these exons and genes, leading to poor sensitivity. De novo gene prediction systems, which ignore ESTs in favor of genomic sequence, can predict such "untouched" exons, but they are less accurate when predicting exons to which ESTs align. TWINSCAN is the most accurate de novo gene finder available for nematodes and N-SCAN is the most accurate for mammals, as measured by exact CDS gene prediction and exact exon prediction. RESULTS: TWINSCAN_EST is a new system that successfully combines EST alignments with TWINSCAN. On the whole C. elegans genome TWINSCAN_EST shows 14% improvement in sensitivity and 13% in specificity in predicting exact gene structures compared to TWINSCAN without EST alignments. Not only are the structures revealed by EST alignments predicted correctly, but these also constrain the predictions without alignments, improving their accuracy. For the human genome, we used the same approach with N-SCAN, creating N-SCAN_EST. On the whole genome, N-SCAN_EST produced a 6% improvement in sensitivity and 1% in specificity of exact gene structure predictions compared to N-SCAN. CONCLUSION: TWINSCAN_EST and N-SCAN_EST are more accurate than TWINSCAN and N-SCAN, while retaining their ability to discover novel genes to which no ESTs align. Thus, we recommend using the EST versions of these programs to annotate any genome for which EST information is available. TWINSCAN_EST and N-SCAN_EST are part of the TWINSCAN open source software package

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A Human-Specific De Novo Protein-Coding Gene Associated with Human Brain Functions

Author: A Siepel
A Siepel
A Varki
AC Marques
B Ewing
Chuan-Yun Li
Chunmei Cao
D Gordon
D Karolchik
D Leister
D Wang
DA Nickerson
DG Knowles
DG Knowles
DJ Begun
DL Hartl
DL Wheeler
EJ Vallender
ES Lander
F Duan
FG Wulczyn
George R. Uhl
GM Cooper
GR Uhl
GR Uhl
J Cai
J Rozas
JP Gong
K Chen
Liping Wei
M Long
M Toll-Riera
M Wu
MT Levine
Philip E. Bourne
Ping-Wu Zhang
Qing-Rong Liu
QR Liu
QR Liu
Quan Du
Quan Yu
RC Gentleman
RR Hudson
S Ohno
SF Saccone
Shu-Juan Lu
ST Chen
T Barrett
UniProt_Consortium
W Peng
W Wang
Xiao-Mo Li
Xiaofeng Zheng
Yan Zhang
Yong Zhang
Z Wu
Zhanbo Wang
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

To understand whether any human-specific new genes may be associated with human brain functions, we computationally screened the genetic vulnerable factors identified through Genome-Wide Association Studies and linkage analyses of nicotine addiction and found one human-specific de novo protein-coding gene, FLJ33706 (alternative gene symbol C20orf203). Cross-species analysis revealed interesting evolutionary paths of how this gene had originated from noncoding DNA sequences: insertion of repeat elements especially Alu contributed to the formation of the first coding exon and six standard splice junctions on the branch leading to humans and chimpanzees, and two subsequent substitutions in the human lineage escaped two stop codons and created an open reading frame of 194 amino acids. We experimentally verified FLJ33706's mRNA and protein expression in the brain. Real-Time PCR in multiple tissues demonstrated that FLJ33706 was most abundantly expressed in brain. Human polymorphism data suggested that FLJ33706 encodes a protein under purifying selection. A specifically designed antibody detected its protein expression across human cortex, cerebellum and midbrain. Immunohistochemistry study in normal human brain cortex revealed the localization of FLJ33706 protein in neurons. Elevated expressions of FLJ33706 were detected in Alzheimer's brain samples, suggesting the role of this novel gene in human-specific pathogenesis of Alzheimer's disease. FLJ33706 provided the strongest evidence so far that human-specific de novo genes can have protein-coding potential and differential protein expression, and be involved in human brain functions

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Sampling and counting genome rearrangement scenarios

Author: A Bergeron
A Bergeron
A Caprara
A Darling
A Karzanov
A Ouangraoua
A Rajaraman
AC Siepel
B Larget
C Chauve
C Zheng
D Sankoff
DVM Braga
E Tannier
E Tannier
G Brightwell
Heather Smith
I Miklós
I Miklós
I Miklós
I Miklós
I Miklós
I Miklós
I Miklós
István Miklós
JS Liu
KM Swenson
L Lovász
LG Valiant
MA Alekseyev
MA Alekseyev
MR Jerrum
MR Jerrum
N Metropolis
P Feijão
PL Erdős
R Durrett
R Warren
S Geman
S Hannenhalli
W Hastings
WM Fitch
Y Ajana
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Even for moderate size inputs, there are a tremendous number of optimal rearrangement scenarios, regardless what the model is and which specific question is to be answered. Therefore giving one optimal solution might be misleading and cannot be used for statistical inferring. Statistically well funded methods are necessary to sample uniformly from the solution space and then a small number of samples are sufficient for statistical inferring

Crossref

SZTAKI Publication Repository

Big Genomes Facilitate the Comparative Identification of Regulatory Elements

Author: A Siepel
AC Groth
AC Spradling
AG Clark
B Ewing
B Ewing
BD Pfeiffer
BP Berman
Brant K. Peterson
CM Bergman
CM Bergman
DA Petrov
DA Petrov
Daniel R. Papaj
DB Jaffe
DL Gumucio
E Birney
E Birney
EE Hare
Emily E. Hare
Eric Jang
G Bosco
GG Loots
J Jiang
J Jiang
KA Frazer
L Elnitski
LA Pennacchio
Laura Conner
LD Stein
M Blanchette
M Fujioka
M Lynch
M Markstein
MA Nobrega
Matthew W. Hahn
MD Bennett
MG Kidwell
Michael B. Eisen
MS Halfon
MT Ross
MZ Ludwig
NS Wratten
P Andolfatto
Rick Kurashima
S Batzoglou
S Fisher
S Richards
S Schwartz
S Small
S Small
S Small
SM Gallo
Steven Storage
TR Gregory
TR Gregory
Venky N. Iyer
WJ Kent
Publication venue: Public Library of Science
Publication date: 04/03/2009
Field of study

The identification of regulatory sequences in animal genomes remains a significant challenge. Comparative genomic methods that use patterns of evolutionary conservation to identify non-coding sequences with regulatory function have yielded many new vertebrate enhancers. However, these methods have not contributed significantly to the identification of regulatory sequences in sequenced invertebrate taxa. We demonstrate here that this differential success, which is often attributed to fundamental differences in the nature of vertebrate and invertebrate regulatory sequences, is instead primarily a product of the relatively small size of sequenced invertebrate genomes. We sequenced and compared loci involved in early embryonic patterning from four species of true fruit flies (family Tephritidae) that have genomes four to six times larger than those of Drosophila melanogaster. Unlike in Drosophila, where virtually all non-coding DNA is highly conserved, blocks of conserved non-coding sequence in tephritids are flanked by large stretches of poorly conserved sequence, similar to what is observed in vertebrate genomes. We tested the activities of nine conserved non-coding sequences flanking the even-skipped gene of the teprhitid Ceratis capitata in transgenic D. melanogaster embryos, six of which drove patterns that recapitulate those of known D. melanogaster enhancers. In contrast, none of the three non-conserved tephritid non-coding sequences that we tested drove expression in D. melanogaster embryos. Based on the landscape of non-coding conservation in tephritids, and our initial success in using conservation in tephritids to identify D. melanogaster regulatory sequences, we suggest that comparison of tephritid genomes may provide a systematic means to annotate the non-coding portion of the D. melanogaster genome. We also propose that large genomes be given more consideration in the selection of species for comparative genomics projects, to provide increased power to detect functional non-coding DNAs and to provide a less biased view of the evolution and function of animal genomes

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

De Novo Origin of Human Protein-Coding Genes

Author: A Siepel
AC Marques
AM Gontijo
C-Y Li
D Karolchik
D Li
D Pan
David J. Begun
David M. Irwin
DG Knowles
DJ Begun
DJ Begun
Dong-Dong Wu
E Betrán
E Bornberg-Bauer
ER Kandel
ET Wang
EW Deutsch
F Jacob
H Kaessmann
J Cai
J Wang
JD Thompson
JJ Cai
JR McCarrey
KC Kleene
L Lin
M Cáceres
M Long
M Long
M Lynch
MA Bakewell
MT Levine
P Jones
P Khaitovich
PC Sabeti
Q Pan
Q Zhou
RF Yeh
RS Hill
S Ohno
ST Chen
T Giger
TJAJ Heinen
W Xiao
Y Xiong
Ya-Ping Zhang
Z Wang
Z Yang
Z Yang
Publication venue: Public Library of Science
Publication date: 01/11/2011
Field of study

The de novo origin of a new protein-coding gene from non-coding DNA is considered to be a very rare occurrence in genomes. Here we identify 60 new protein-coding genes that originated de novo on the human lineage since divergence from the chimpanzee. The functionality of these genes is supported by both transcriptional and proteomic evidence. RNA–seq data indicate that these genes have their highest expression levels in the cerebral cortex and testes, which might suggest that these genes contribute to phenotypic traits that are unique to humans, such as improved cognitive ability. Our results are inconsistent with the traditional view that the de novo origin of new genes is very rare, thus there should be greater appreciation of the importance of the de novo origination of genes

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Comparative Analysis of Human Protein-Coding and Noncoding RNAs between Brain and 10 Mixed Cell Lines by RNA-Seq

Author: A Huttenhofer
A Siepel
AA Aravin
AC Marques
B Langmead
B Li
Bing He
C Garofalo
C Trapnell
CA Brosnan
CC Babbitt
CZ Han
DL Black
DM Cork
E Birney
ET Wang
FF Costa
Geng Chen
I Martianov
J Feng
JE Wilusz
Jian Luo
Jürgen Brosius
K Fejes-Toth
K Hashimoto
K Laud
Kangping Yin
L Kong
L Shi
Leming Shi
M Griffith
M Guttman
M Guttman
M Ishikawa
M Mallardo
M Sammeth
M Wrage
Mingyao Liu
N Novoradovskaya
P Carninci
Peng Li
PJ French
R Klinck
R Louro
RA Gupta
S Marguerat
SW Blume
Tieliu Shi
TR Mercer
TR Mercer
U Nagalakshmi
UA Orom
V Pedraza
X Cai
Y Lee
Y Okazaki
Ya Qi
Yuanzhang Fang
Z Wang
Publication venue: Public Library of Science
Publication date: 30/11/2011
Field of study

In their expression process, different genes can generate diverse functional products, including various protein-coding or noncoding RNAs. Here, we investigated the protein-coding capacities and the expression levels of their isoforms for human known genes, the conservation and disease association of long noncoding RNAs (ncRNAs) with two transcriptome sequencing datasets from human brain tissues and 10 mixed cell lines. Comparative analysis revealed that about two-thirds of the genes expressed between brain and cell lines are the same, but less than one-third of their isoforms are identical. Besides those genes specially expressed in brain and cell lines, about 66% of genes expressed in common encoded different isoforms. Moreover, most genes dominantly expressed one isoform and some genes only generated protein-coding (or noncoding) RNAs in one sample but not in another. We found 282 human genes could encode both protein-coding and noncoding RNAs through alternative splicing in the two samples. We also identified more than 1,000 long ncRNAs, and most of those long ncRNAs contain conserved elements across either 46 vertebrates or 33 placental mammals or 10 primates. Further analysis showed that some long ncRNAs differentially expressed in human breast cancer or lung cancer, several of those differentially expressed long ncRNAs were validated by RT-PCR. In addition, those validated differentially expressed long ncRNAs were found significantly correlated with certain breast cancer or lung cancer related genes, indicating the important biological relevance between long ncRNAs and human cancers. Our findings reveal that the differences of gene expression profile between samples mainly result from the expressed gene isoforms, and highlight the importance of studying genes at the isoform level for completely illustrating the intricate transcriptome

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

A Comprehensive Genetic Analysis of Candidate Genes Regulating Response to Trypanosoma congolense Infection in Mice

About one-third of cattle in sub-Saharan Africa are at risk of contracting “Nagana”—a disease caused by Trypanosoma parasites similar to those that cause human “Sleeping Sickness.” Laboratory mice can also be infected by trypanosomes, and different mouse breeds show varying levels of susceptibility to infection, similar to what is seen between different breeds of cattle. Survival time after infection is controlled by the underlying genetics of the mouse breed, and previous studies have localised three genomic regions that regulate this trait. These three “Quantitative Trait Loci” (QTL), which have been called Tir1, Tir2 and Tir3 (for Trypanosoma Infection Response 1–3) are well defined, but nevertheless still contain over one thousand genes, any number of which may be influencing survival. This study has aimed to identify the specific differences associated with genes that are controlling mouse survival after T. congolense infection. We have applied a series of analyses to existing datasets, and combined them with novel sequencing, and other genetic data to create short lists of genes that share polymorphisms across susceptible mouse breeds, including two promising “candidate genes”: Pram1 at Tir1 and Cd244 at Tir3. These genes can now be tested to confirm their effect on response to trypanosome infection

University of Liverpool Repository

Public Library of Science (PLOS)

Research UNE

University of Salford Institutional Repository

Crossref

Directory of Open Access Journals

PubMed Central

Edinburgh Research Explorer

CGSpace

The University of Manchester - Institutional Repository